Introduction to Statistical Methods
Department of Educational Psychology
| Student | Q1 | Q2 | Q3 | Q4 | Q5 |
|---|---|---|---|---|---|
| Student 1 | 1 | 0 | 1 | 1 | 0 |
| Student 2 | 0 | 1 | 0 | 1 | 1 |
| Student 3 | 1 | 1 | 1 | 0 | 0 |
| Student 4 | 0 | 0 | 1 | 0 | 1 |
| Student 5 | 1 | 1 | 0 | 1 | 1 |
Stem-and-leaf graphs, also known as stemplots, are technically a tabular method to represent data - but function to visually inspect the distribution of the data.
Stemplots a valid choice for representing numeric, quantitative data for a single variable
Stemplots contain a leaf which contains the final significant digit (in practice this is usually just the last digit)
Practically, stemplots are useful showing distribution, skew, and frequency of values
Example of test scores 0 - 100: 2, 3, 5, 6, 9, 11, 14, 15, 16, 20, 21, 23, 27, 30, 32, 38, 41, 44, 46, 53, 55, 59, 60, 62, 67, 74, 79, 81, 85, 90
| Stem | Leaf |
|---|---|
| 0 | 2 3 5 6 9 |
| 1 | 1 4 5 6 |
| 2 | 0 1 3 7 |
| 3 | 0 2 8 |
| 4 | 1 4 6 |
| 5 | 3 5 9 |
| 6 | 0 2 7 |
| 7 | 4 9 |
| 8 | 1 5 |
| 9 | 0 |
In talking about frequency, we inherently include talk about distribution of data, or rather, how it is spread out
While the above plots are fine for more discrete data, there are several plot types the lend especially well to laying out continuous data
A Histogram, at first glance, looks much like a bar plot, as described prior.
However, rather than use individual discrete points or labels, histograms will group values by a defined interval/class/bin width, and count the frequencies of values within that bin
\[ Q_3 - Q_1 \]
\[ i = \frac{k}{100}*(n + 1) \]
Dataset: 5, 6, 7, 8, 9 (ordered smallest to largest)
Prompt: What value corresponds to the 70th percentile?
\[ i = \frac{70}{100}*(5 + 1) \]
\[ i = 0.70*6 \]
\[ i = 4.2 \]
We round \(i\) up and down to the 4th and 5th ranks, which correspond to values 8 and 9 in the dataset, their average is 8.5 \(\rightarrow\) 70th percentile
\[ \frac{x+0.5y}{n}*100 \]
Dataset: 5, 6, 7, 8, 9 (ordered smallest to largest)
Prompt: What percentile is value ‘8’ at?
\[ \frac{3+0.5(1)}{5}*100 \]
\[ \frac{3.5}{5}*100 \]
\[ 70 \rightarrow percentile \]
As a simple check, always ensure your values are arranged smallest to largest when calculating percentiles, quartiles, and the median
Smaller percentiles/quartiles \(\rightarrow\) smaller values in the data set
\(Q_2\) = 50th percentile = median
\[ \bar{x} = \frac{x_1 + x_2 + x_3 + .... + x_n}{n} \]
\[ s^2 = \frac{\sum{(x-\bar{x})^2}}{n-1} \]
\[ s = \sqrt{\frac{\sum{(x-\bar{x})^2}}{n-1}} \]
\[ s = \sqrt{s^2} \]
While not necessary with computers and calculators, it can be useful to work out statistics “by hand” for learning how they work. For example:
Dataset: 1, 2, 3, 4, 5
Size of sample: \(n = 5\)
Sample Mean:
\[ \bar{x} = \frac{x_1 + x_2 + x_3 + .... + x_n}{n} \]
\[ 3 = \frac{1 + 2 + 3 + 4 + 5}{5} \]
Sample Variation:
\[ s^2 = \frac{\sum{(x-\bar{x})^2}}{n-1} \]
Hint: whenever you see a \(\sum\) sign, follow this tabular procedure
| x | x - xbar | (x-xbar)^2 |
|---|---|---|
| 1 | 1 - 3 | -2^2 |
| 2 | 2 - 3 | -1^2 |
| 3 | 3 - 3 | 0^2 |
| 4 | 4 - 3 | 1^2 |
| 5 | 5 - 3 | 2^2 |
| Sum | 10 |
\[ 2.5 = \frac{10}{5-1} \]
Sample Standard Deviation:
\[ s = \sqrt{s^2} \]
\[ 1.58 = \sqrt{2.5} \]
Much like describing a specific value with a percentile, we may want to describe how far away from the mean a certain point is, and we can do so using the standard deviation
We can do this with what are called z-scores with the following formula for sample:
\[ z = \frac{x - \bar{x}}{s} \]
By hand (pulling from last example):
\[ -0.63 = \frac{2 - 3}{1.58} \]
We could then say that the value of 2 is -0.63 standard deviations away from the mean of 3
Understanding our data first starts with describing it, and we can accomplish that both through informative graphs and statistics
The various graphs show slightly different information, and multiple options may be used simultaneously to more thoroughly show the characteristics of the data
Measures of central tendency and dispersion can succinctly describe the center of data, and how spread out it is, respectively
The statistics we calculate on our sample are meant to accurately estimate the parameters of the population distribution, but that only works if our sample is representative and our data unbiased!
Outliers and discussion on skew will return later when we are talking about their impact on inferential statistics, but don’t worry too much about that yet!
Module 2 Lecture - Descriptive Statistics || Introduction to Statistical Methods